{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 26 - Permutation tests\n", "\n", "In Lab 5 we looked used confidence intervals to compare the means of two groups to determine if they were different. In this lab, we'll use hypothesis testing to compare the means.\n", "\n", "First, we'll download the green taxi trip data for Dec. 5, 2018.\n", "\n", "The NYC Open Data datast of all 2018 green taxi trips is here: [https://data.cityofnewyork.us/Transportation/2018-Green-Taxi-Trip-Data/w7fs-fd9i](https://data.cityofnewyork.us/Transportation/2018-Green-Taxi-Trip-Data/w7fs-fd9i)\n", "\n", "The dataset contains almost 9 million rows, so we will filter the data to only be trips from Dec. 5, 2018 to make the dataset smaller. To do this:\n", "- Click on the \"Filter\" button.\n", "- On the menu that appear, click on \"Add a New Filter Condition\".\n", "- Choose \"lpep_pickup_datetime\" but change the \"is\" to be \"is between\".\n", "- Click in the box below and a calendar will pop up. Highlight December 5, 2018.\n", "- Click the second box below and a calendar will pop up. Highlight December 6, 2018.\n", "- It will take a few seconds (it's a large file) but the rows on the left will be filtered to be all trips with pickups between Dec. 5 2018 at 12am and Dec. 6 2018 at 12am, or all counts with pickups on Dec. 5.\n", "\n", "To download the file,\n", "- Click on the \"Export\" button.\n", "- Under \"Download\", choose \"CSV\".\n", "- The download will begin automatically (files are usually stored in \"Downloads\" folder).\n", "\n", "First, let's import the necessary libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comparing mean trip distances with 1 or more than 1 passengers\n", "\n", "Is there a different in the mean trip distance for trips taken with only 1 passenger and trips taken with more than 1 passengers?\n", "\n", "We will test this hypothesis.\n", "\n", "#### Hypothesis testing step 1\n", "\n", "Null hypothesis: The mean trip distance for trips with only 1 passenger is the same as the mean trip distance for trips with 2 or more passengers.\n", "\n", "Alternative hypothesis: The mean trip distance for trips with only 1 passenger is different than the mean trip distance for trips with 2 or more passengers.\n", "\n", "Before proceeding further, load the green taxi trip data into the dataframe `taxi`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Hypothesis testing step 2\n", "\n", "Our test statistic will be the difference in mean trip distance between trips with only 1 passenger and trips with 2 or more passengers. To calculate the test statistic for the data:\n", "\n", "1. Compute the mean trip distance when there is only 1 passenger.\n", "2. Compute the mean trip distance when there are 2 or more passengers.\n", "3. Subtract mean 1 from mean 2 and take the absolute value.\n", "\n", "Let's do step 1: compute the mean trip distance when there is only 1 passenger.\n", "\n", "
Hint:\n", "a. Use a filter to create a new dataframe containing only trips with 1 passenger.
\n", "b. Compute the mean trip distance in the new dataframe.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Step 2: Compute the mean trip distance when there are 2 or more passengers." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Step 3: Subtract mean 1 from mean 2 and take the absolute value.\n", "\n", "This value is the test statistic for our data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Hypothesis testing step 3\n", "Step 3 is to simulate the test statistic assuming the null hypothesis is true.\n", "\n", "We will do this by permuting (randomly changing) the passenger count data in the dataframe, without changing any other columns. If the passenger count doesn't matter, then switching it around shouldn't change the difference in means. \n", "\n", "First let's make a new dataframe called `permuted_taxi` by loading the data from the CSV file again." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code will permute the `passenger_count` column and then display the new dataframe:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "permuted_taxi['passenger_count'] = np.random.permutation(permuted_taxi['passenger_count'])\n", "permuted_taxi.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare the first few rows of `permuted_taxi` with the first few rows of `taxi`. Some of the `passenger_count` values should have changed.\n", "\n", "Compute the difference between mean trip distance with 1 passenger and the mean trip distance with 2 or more passengers using the `permuted_taxi` dataframe. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's repeat these steps (permuting the `passenger_count` column and computing the difference between the two means in the permuted dataframe) many times, storing the mean differences is a list.\n", "\n", "Remember, use a small number of iterations to test your code, so it is faster." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Hint:\n", "The pseudo-code is:\n", "\n", "create an empty list\n", "loop 10,000 times:\n", " randomly permute the passenger cout column\n", " compute the mean trip distance for trips with only 1 passenger\n", " compute the mean trip distance for trips with 2 or more passengers\n", " compute the difference between the two means\n", " store the difference in your list\n", "\n", "
\n", "\n", "Graph the histogram of the differences in means that you computed assuming the null hypothesis is true." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Hypothesis testing step 4\n", "Compare the difference in means from the data with the histogram. Does your data test statistic look like it comes from the histogram distribution?\n", "\n", "Reject or fail to reject the null hypothesis.\n", "\n", "If you have time, create and test another hypothesis for the green taxi trip data." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }